In this notebook we will talk about quantization and oversampling and we will do so by taking a trip down memory lane to revisit the early days of sound effects in video games and home computers. We'll start from monophonic square waves, introduce polyphony by way of pulse-width modulation and finish with the basics of sigma-delta quantization.

%matplotlib inline
import matplotlib
import matplotlib.pyplot as plt
import numpy as np
import scipy.signal as sp
from scipy.io import wavfile
import IPython
plt.rcParams["figure.figsize"] = (14,4)
In the analog world the simplest "musical" waveform is the sinusoidal oscillation, since sinusoids describe the oscillatory behavior of physical objects such as strings, rods and air columns in pipes.

In the digital world, on the other hand, the simplest way to create a sound is to drive a loudspeaker with a two-level signal that alternates between two fixed voltage values. The resulting waveform is a square wave, which is the prototypical signal generated by an astable digital oscillator such as the simple circuit based on logic gates shown here:

The samples of a discrete-time square wave take values over a set of only two possible values (high and low, or +1 and -1) and so each sample can be encoded using only one bit; this is smallest quantization granularity for a digital signal.
In the first digital consumer devices (such as early video games and home computers) the interfaces to the outside world were kept as simple as possible in order to minimize the cost of hardware. Early processors had only a few output data lines (usually 8), each one of which could be independently driven to a high or low level by setting or resetting the appropriate bit in an internal register. For the audio interface it was common to reserve a single data line, and drive the loudspeaker directly via the internal bit value: the only waveforms sent to the loudspeaker were therefore square waves. This simple setup had two great advantages:
A square wave with frequency $\omega_0$ can be easily generated mathematically by thresholding a sinusoidal function:
$$ x[n] = \begin{cases} +1 & \mbox{if $\sin(\omega_0 n) \ge 0$} \\ -1 & \mbox{if $\sin(\omega_0 n) \lt 0$} \end{cases} $$def square_wave_exact(w, N):
return np.where(np.sin(np.arange(0, N) * w) >= 0, 1, -1)
N=100
w = .13 * np.pi
plt.plot(np.sin(np.arange(0, N) * w), 'g', linewidth=3)
plt.stem(square_wave_exact(w, 100))
plt.ylim(-1.2, 1.2);
While this is mathematically accurate, it requires the use of trigonometric computations, which were usually too expensive in terms of CPU cycles. On the other hand, consider the period (in samples) of the same square wave:
$$ P = \frac{2\pi}{\omega_0} $$If $\omega_0 = 2\pi / P$ for $P \in \mathbb{N}$ then the period is equal to an integer number of samples and the square wave can be synthesized simply as
$$ x[n] = \begin{cases} +1 & \mbox{if $(n \mod P) \le (P/2)$} \\ -1 & \mbox{otherwise} \end{cases} $$If $2\pi/\omega_0$ is not an integer, we can simply try and use $P=\mbox{round}(2\pi/\omega_0)$. The rounding operation will cause a detuning of the waveform's pitch with respect to the nominal frequency; as the latter increases, the relative error in the rounded period will grow proportionally and the played notes will sound more and more out of tune; we will hear an example later.
def square_wave_cheap(w, N):
p = np.round(2 * np.pi / w)
return np.where((np.arange(0, N) % p) >= (p/2), -1, 1)
# we plot the correct square wave in red to show the detuning of the "cheap" version
plt.plot(np.sin(np.arange(0, N) * w), 'g', linewidth=3)
plt.plot(square_wave_exact(w, 100), 'red')
plt.stem(square_wave_cheap(w, 100))
plt.ylim(-1.2, 1.2);
The second advantage of a square wave is that it is a 1-bit signal and 1-bit signals can be applied directly to a loudspeaker via a simple zero-order interpolator. Basically, for 1-bit signals, the D/A converter can even be omitted entirely and the CPU data line can be used to drive the speaker. In the figure below, you can see the relevant portion of the schematics of the ZX Spectrum, one of the most popular home computers of the 80s; the loudspeaker is directly connected to pin 28 of the data bus:

OK, time to play some simple 1-bit music: for this we will need a function to convert pitch names to frequencies (we used the note_to_freq function in an earlier notebook so we will just load it from an auxiliary file) and a parser to read a sequence of notes and produced the appropriate square wave segments.
You should be able to recognize the tune...
from music import note_to_freq
def play_notes(melody, time_scale=1, rate=32000, wave_engine=square_wave_cheap):
# melody is a tuple of pairs, each pair containing the pitch and the duration
# of each note; time_scale gives the base length of a note of unit duration
s = []
for note in melody:
f = 2 * np.pi * note_to_freq(note[0]) / float(rate)
#
N = int(note[1] * rate * time_scale)
if f > 0:
s = np.concatenate((s, wave_engine(f, N)))
else:
s = np.concatenate((s, np.ones(N)))
return s
tune = (('B4', 2), ('B5', 2), ('F#5', 2), ('D#5', 2), ('B5', 1), ('F#5', 3), ('D#5', 4),
('C5', 2), ('C6', 2), ('G5', 2), ('E5', 2), ('C6', 1), ('G5', 3), ('E5', 4),
('B4', 2), ('B5', 2), ('F#5', 2), ('D#5', 2), ('B5', 1), ('F#5', 3), ('D#5', 4),
('D#5', 1), ('E5', 1), ('F5', 2), ('F5', 1), ('F#5', 1), ('G5', 2), ('G5', 1),
('G#5', 1), ('A5', 2), ('B5', 4))
SF = 24000
jingle = play_notes(tune, time_scale=0.06, rate=SF)
IPython.display.Audio(jingle, rate=SF)
We can experience the limitations of the "cheap" square wave generator if we lower the sampling rate, so that the relative error in the period becomes bigger (this should also sound familiar to those who played with early video games):
SF=8000
IPython.display.Audio(play_notes(tune, time_scale=0.06, rate=SF), rate=SF)
We could try to ameliorate the detuning by using the precise square wave generator instead; if we do so, the pitches are now correct but we hear artifacts due to the fact that most notes cover a fractional number of periods; this "sound" as well should be familiar to vintage gamers):
SF=8000
IPython.display.Audio(play_notes(tune, time_scale=0.06, wave_engine=square_wave_exact, rate=SF), rate=SF)
If you recognized the tune, you also noticed that something is missing: indeed, the original PacMan jingle is not monophonic, it also contains a simple bass line. We can generate the bass as a separate files and sum it to the treble, to recover the full PacMan experience:
tune_bass = (('B2', 6), ('B3', 2), ('B2', 6), ('B3', 2), ('C3', 6), ('C4', 2), ('C3', 6), ('C4', 2),
('B2', 6), ('B3', 2), ('B2', 6), ('B3', 2),
('F#3', 4), ('G#3', 4), ('A#3', 4), ('B3', 4))
SF = 24000
pacman = jingle + play_notes(tune_bass, time_scale=0.06, rate=SF)
IPython.display.Audio(pacman, rate=SF)
The problem, however, is that the sum of two square waves is no longer a two-level square wave and therefore it cannot be represented with 1 bit per sample. We can see this if we plot the polyphonic PacMan waveform in detail:
plt.plot(pacman[480:900])
plt.ylim(-2.2, 2.2);
You can see that the waveform now takes on three possible values: -2, 0, and 2. Since we will be checking more waveforms in the future, let's introduce some quick test functions:
def distinct_values(x):
v = set()
for k in x:
v.add(k)
return v
def is_two_level(x):
v = distinct_values(x)
print('the signal is', '' if len(v) == 2 else 'NOT', 'two-level; values: ', end='')
print(sorted(list(v)))
# the monophonic tune is two-valued...
is_two_level(jingle)
# but the polyphonic tune is not!
is_two_level(pacman)
At this point, a rather desperate approach could be to just ignore the extended output range and quantize the polyphonic signal to 1 bit:
sq = np.where(pacman >= 0, 1, -1)
print(distinct_values(sq))
IPython.display.Audio(sq , rate=SF)
Although the sound is definitely interesting (and although some old games did sound like that!), the result is quite poor... We need to be smarter.
The way to encode a polyphonic tune at one bit per sample requires four fundamental intuitions:
So far we have considered only balanced square waves: if the period of the wave is $P$ samples, we assumed that the signal would be +1 over the first $P/2$ samples and -1 for the second half of the period. However we can define asymmetric square waves of period $P$ where there are $D$ samples equal to +1 and $P-D$ samples equal to -1. The ratio $D/P$ is called the duty cycle of the square wave; an important thing for what follows is that the average value of one period of a square wave with duty cycle $C$ is $2C-1$. Clearly, for a balanced square wave $C=0.5$ and the average value is zero.
def square_wave_vdc(P, D, N):
# build a variable duty cycle square wave
# one period:
chunk = -np.ones(P)
chunk[:D] = 1
# full wave
sw = np.tile(chunk, int(N / P))
return sw
def show_sqw_vdc(P, D):
plt.stem(square_wave_vdc(P, D, 30))
plt.ylim(-1.2, 1.2);
plt.xlim(-2, 32)
plt.title('duty cycle: ' + str(D) + '/' + str(P))
plt.subplot(1, 3, 1)
show_sqw_vdc(4, 2)
plt.subplot(1, 3, 2)
show_sqw_vdc(4, 3)
plt.subplot(1, 3, 3)
show_sqw_vdc(5, 3)
A lowpass filter always performs some kind of averaging; if we lowpass filter a square wave with duty cycle $C$, the output will tend to and oscillate around the average of the wave's period, i.e. around the value $2C-1$. We can test this easily by using a leaky integrator:
def square_wave_lp(P, D):
# signal length
N = 200
# build one period of the square wave
chunk = -np.ones(P)
chunk[:D] = 1
# build the full wave
sw = np.tile(chunk, int(N / P))
# now filter the square wave with a leaky integrator
alpha = 0.98
swf = sp.lfilter([1-alpha], [1, -alpha], sw)
# average value
A = 2.0 * float(D) / float(P) - 1
plt.plot(sw, 'green', swf, 'red', [0, N], [A, A], 'blue')
plt.ylim(-1.2, 1.2);
plt.title('duty cycle: ' + str(D) + '/' + str(P) + ', avg=' + str(A))
plt.subplot(1, 3, 1)
square_wave_lp(4, 2)
plt.subplot(1, 3, 2)
square_wave_lp(4, 3)
plt.subplot(1, 3, 3)
square_wave_lp(5, 3)
Averaging (or lowpass filtering) a two-level square wave seems to be a viable way to generate intermediate values between -1 and 1 using a two-level signal. The technique goes under different names according to how you look at it: if the focus is on the correspondence between duty cycle and averaged value, the name of choice is pulse width modulation, since we're using the shape of the basic square wave pulse to generate the output. Another common term is dithering, which is borrowed from digital imaging; the origin of the name is extremely fascinating in and of itself.
So the question now is: where do we get a lowpass filter?
A loudspeaker, especially a cheap one, will not be able to vibrate effectively at the low and high ends of the audible spectrum because of the physical limitations of its internal mechanics; at best the magnitude response will be flat up to 20KHz and decay rapidly afterwards. The lowpass characteristic of the loudspeaker is usually accentuated by the presence of limiting capacitors (as in the Spectrum schematics above) that prevent overloading the speaker's coil.
So the lowpass is naturally present when driving a speaker with a square wave. The first spectral line in a square wave of period $P$ with a sampling frequency $F_s$ is going to be at $F_s/P$, independently of duty cycle. If we want to make sure that we don't hear any high-frequency artifacts when using the lowpass trick, we need to use square waves whose first harmonic falls outside of the frequency range of the loudspeaker. This is achieved by using oversampling.
Even in early electronic devices, the CPU's clock was in the MHz range; that meant that the bit driving the loudspeaker could be flipped many times faster than the rate required to obtain a sound in the audible range. For instance, the ZX Spectrum's CPU (a Zilog Z80) ran at 3.5MHz; omitting boring details, that means one could output 1-bit samples at about 200KHz which represents a 10-time oversampling with respect to the bandwidth of the loudspeaker. Or, in other words, one could use square waves with varying duty cycles with a period up to 10 samples.
OK, let's apply the four concepts above to the PacMan jingle. We'll be working with contemporary PC soundcards, so we can't achieve the high data rates we just discussed. However, assuming the highest sampling rate in your system's soundcard is 96KHz and seeing how we synthesized the PacMan jingle at 24KHz, we can obtain an upsampling factor of 4.
Let's take the three-level jingle and replace all values like so:
With this, we'll obtain a signal that we can play at 96KHz; since the "fast" alternating segments are a wave with $P=2$, the corresponding first harmonic is at 48KHz and we should not be able to hear it.
v = 1
pacman1bit = np.ones(len(pacman) * 4)
for k in range(0, len(pacman)):
if abs(pacman[k]) < 0.5: # floating tolerance
pacman1bit[k*4:(k+1)*4] = [1, -1, 1, -1]
else:
pacman1bit[k*4:(k+1)*4] = np.sign(pacman[k]) * np.ones(4)
is_two_level(pacman1bit)
OK, the signal is two-level. We can compare the original and the 1-bit version:
a = 600
b = 700
plt.plot(np.arange(a,b), pacman[a:b], 'blue',
np.arange(a*4,b*4)/4, pacman1bit[a*4:b*4], 'red')
plt.ylim(-2.2, 2.2);
And finally we can play it. And it works!!!
IPython.display.Audio(pacman1bit, rate=96000)
Here, for reference, is the original once again:
IPython.display.Audio(pacman, rate=24000)
We can try and generalize the above approach to more that two voices, although we won't be able to go very far using the standard Python audio interface.
The first thing to notice is that, if we sum $N$ square waves together, the resulting signal will contain up to $N+1$ levels with values in the following set:
$$ -N, -(N-2), \ldots, (N-2), N $$To map this onto a 1-bit signal, we will require at least an oversampling factor of $N$, since a square wave period of length $N$ allows for $N+1$ different duty cycles (and therefore, when averaged, $N+1$ output values).
Let's try and write a function that does this generic mapping for us and apply it to a famous bit of 4-part music. Since we have four voices and our maximum sampling rate is 96KHz, we need to synthesize it at 24KHz; we will use the "expensive" square wave generator to minimize detuning.

bdbm_1 = (('Bb4', 4), ('Eb5', 6), ('F5', 2), ('D5', 8), (' ', 4),
('Eb5', 4), ('Ab4', 4), ('Ab4', 4), ('Ab4', 8), ('G4', 4),
(' ', 2), ('Bb4', 2), ('D5', 2), ('Bb4', 2), ('A4', 2), ('Bb4', 2),
('F4', 2), ('Bb4', 2), ('D5', 2), ('Bb4', 2), ('A4', 2), ('Bb4', 2),
('Eb4', 4), ('C5', 6), ('D5', 1), ('Eb5', 1),
('D5', 3), ('C5', 1), ('Bb4', 3), ('C5', 1), ('F4', 3), ('A4', 1),
('Bb4', 12), )
bdbm_2 = (('G4', 4), ('G4', 4), ('A4', 4), ('Bb4', 8), (' ', 4), ('Bb4', 4),
('F4', 4), ('F4', 4), ('F4', 8), (' ', 4), ('G4', 12), ('F4', 12),
(' ', 12), (' ', 8), ('Eb4', 4), (' ', 12),)
bdbm_3 = ((' ', 12), ('F4', 8), (' ', 4), ('Eb4', 4), ('F4', 4), ('C3', 4),
('Bb3', 4), ('D4', 4), ('Eb4', 4), (' ', 12), ('D4', 12), ('Bb3', 4),
('F4', 4), ('A4', 4), ('Bb4', 4), ('G4', 4), ('Eb4', 4), ('D4', 12), )
bdbm_4 = (('Eb3', 4), ('C3', 4), ('F3', 4), ('Bb2', 4), ('Bb3', 4), ('Ab3', 4),
('G3', 4), ('F3', 4), ('Eb3', 4), ('D3', 4), ('Bb2', 4), ('Eb2', 4),
('E2', 4), ('E2', 4), ('E2', 4), ('F2', 4), ('F2', 4), ('F2', 4),
('G2', 4), ('A2', 4), ('F2', 4), ('Bb2', 4), ('Eb2', 4), ('F2', 4), ('Bb2', 12), )
SF=24000
s = play_notes(bdbm_1, time_scale=0.2, rate=SF, wave_engine=square_wave_exact)
s += play_notes(bdbm_2, time_scale=0.2, rate=SF, wave_engine=square_wave_exact)
s += play_notes(bdbm_3, time_scale=0.2, rate=SF, wave_engine=square_wave_exact)
s += play_notes(bdbm_4, time_scale=0.2, rate=SF, wave_engine=square_wave_exact)
IPython.display.Audio(s, rate=SF)
Here is the generic "dithering" function to convert the audio to 1 bit:
def dither(waveform, rate):
MAX_RATE = 96000
values = distinct_values(waveform)
voices = len(values) - 1
if (rate * voices) > MAX_RATE:
print('cannot dither: it would require too large a sampling rate')
return None
for n in range(0, voices):
if not (-voices + 2*n) in values:
print('signal does not seem to be the sum of ', voices, 'square waves')
return None
s = np.zeros(len(waveform) * voices)
# now replace each sample with one period of a square wave with appropriate duty cycle
for n in range(0, len(waveform)):
# let's start with a duty cycle of 100%
chunk = np.ones(voices)
if waveform[n] < 0:
chunk = -chunk
# let's distribute the transitions evenly over the period
flips = int((voices - abs(waveform[n])) / 2)
chunk[0:2*flips:2] *= -1
s[n*voices:(n+1)*voices] = chunk
return s, rate * voices
Let's process the four-part audio, verify it's two-level and let's play it at the oversampled rate:
sd, drate = dither(s, SF)
is_two_level(sd)
IPython.display.Audio(sd, rate=drate)
When the number of voices increases, unless we can oversample more we will start to hear artifacts due to the fact that the number of periods of the "fast" square wave pieces are not long enough for the implicit loudspeaker lowpass to converge to the mean. We can "help" the process by explicitly lowpass filtering the 1-bit signal and it does sound better, although of course the higher harmonics in the frequency content are attenuated:
b, a = sp.butter(8, 0.15)
IPython.display.Audio(sp.lfilter(b, a, sd), rate=drate)